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Abstract — Parity declustering allows faster reconstruc- 
tion of a disk array when some disk fails. Moreover, it guar- 
antees uniform reconstruction workload on all surviving 
disks. It has been shown that parity declustering for one- 
failure tolerant array codes can be obtained via Balanced 
Incomplete Block Designs. We extend this technique for 
array codes that can tolerate an arbitrary number of disk 
failures via t-designs. 



I. Introduction 

RAID (Redundant Array of Independent Disks) has 
been widely used as a large-scaled and reliable storage 
system since its introduction in 1988 iflOl . However, the 
key limitation of the first 6 levels of RAID (RAID-0 to 
RAID-5) is that system recovery can be possible with 
only one disk failure. RAID-6 has been proposed as a 
new RAID standard, which requires that any one or two 
disk failures can be fixed. Several types of codes that 
can correct two erasures have been proposed, such as 
Reed-Solomon (RS) code OH, EVEN-ODD code 0, 
B-code El, X-code E3, RDP code 0. Codes that 
allow the recovery from more than two failures have also 
been investigated ifTTl . lfl2l . lfl4l . The main limitation of 
RS codes is the high encoding and decoding complexity, 
which involves computation over finite fields. The other 
types of codes, called array codes, are preferred by stor- 
age system designers due to the fact that their encoding 
and decoding requires only XOR operations. 

The majority of known array codes are MDS (Maxi- 
mum Distance Separable) codes. MDS array codes have 
optimal redundancy (S redundant disks are used in a 
(5-erasure-correcting array code). The main issue with 
them is that when S disks fail, data in every surviving 
disk has to be read for reconstruction. This results in 
slow reconstruction time when disk capacities get larger 
and increases the possibility of another failure, which 
renders the reconstruction impossible. Moreover, as all 
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disks must be fully accessed for the recovery purpose, 
the system operates in its degraded mode: responses to 
user requests take longer time than usual. 

Parity declustering was proposed by Muntz and Lui 
IfTTl as a data layout technique that allows faster recon- 
struction and uniform reconstruction workloads on sur- 
viving devices during reconstruction of one disk failure. 
Here, the reconstruction workload refers to the amount 
of data that needs to be accessed on the surviving 
disks in order to reconstruct the data on the failed disk. 
Faster reconstruction stems from the feature of the data 
layout that requires only a partial access instead of a 
full access to each surviving disk. In other words, the 
special layout allows reconstruction of data on a failed 
disk without reading all data in every surviving disk. 
Muntz and Lui suggested that designing such a layout 
is a combinatorial block design problem, but gave no 
further details. Holland and Gibson [ 13 1, Ng and Mattson 
lfl8l investigated the construction of parity-declustered 
data layouts from Balanced Incomplete Block Designs 
(BIBD). The work of Reddy and Banerjee ||20l also 
followed the same approach, even though they focused 
more on a special type of BIBDs. 

For codes that can tolerate 5 > 2 disk failures, it is 
also desirable to have a declustered-parity data layout. 
More specifically, we want to design a data layout such 
that when at most 6 disks fail, only a portion of the disk 
content on each healthy disk needs to be accessed for the 
recovery process. Moreover, the reconstruction workload 
is distributed uniformly to all surviving disks. There 
has been several work where parity declustering for 5- 
failure tolerant codes (S > 2) are considered, such as U] 
and 0. However, none of them guarantee the uniform 
workloads during the reconstruction of more than one 
disk. Corbett [8 1 proposed that two (or more) array codes 
of the same size can be combined into a larger array 
that has almost uniform reconstruction workloads when 
one or two disks fail. However, Corbett's method only 
achieves uniform workloads among the data disks, not 
over all surviving disks (data disks and parity disks). 
Moreover, his construction produces an array code of a 
prohibitively large size, which is at least ( n ™ 2 ) x n - 

We investigate the construction of declustered-parity 
layouts for codes that tolerate t — 1 disk failures via 
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i-designs (t > 2). In fact, BIBDs, which are used to 
decluster parities for one-failure tolerant codes, are 2- 
designs. The main idea is to start with an array code of 
k columns that has uniform workloads for reconstruction 
of every s < t — 1 columns. Then, the k columns 
of this code are spread out over n > k disks, using 
blocks of a i-design (see Section [El] for all definitions). 
As a result, we obtain an array code with n disks that 
possesses the following properties. Firstly, in order to 
recover any s < t — 1 disks, only a portion of the disk 
content, which is a designed parameter, must be read 
for disk recovery. Secondly, the reconstruction workload 
is uniformly distributed to every surviving disk. And 
lastly, the parity units are distributed evenly over all 
disks, which eliminates hot spots during data update. To 
the best of our knowledge, this is the first work that 
extends the well-known parity declustering technique 
(originally proposed for one-failure tolerant codes) for 
(5-failure tolerant codes, for any S > 1. 

The paper is organized as follows. Necessary defini- 
tions and notations are provided in Section [TT] In this 
section, we also review the parity declustering technique 
for one-failure tolerant codes based on BIBDs. We extend 
this technique for two-failure tolerant codes via 3-designs 
in Section |lll] In Section HVl we discuss the generaliza- 
tion of this idea for codes that can tolerate 6 > 2 disk 
failures. The paper is concluded in Section [V] 

II. Preliminaries 

Disk arrays spread data across several disks and access 
them in parallel to increase data transfer rates and I/O 
rates. Disk arrays are, however, highly vulnerable to disk 
failures. An array with n disks is n times more likely to 
fail than a single disk iflOl . Adding redundancy to a disk 
array is a natural solution to this problem. Units of data 
on k disks are grouped together into parity groups (or 
parity stripes). Each parity group consists of k — 1 data 
units and one parity unit. The parity unit is calculated 
by taking the XOR-sum of the data units in the same 
group. The parity unit must be updated whenever a data 
unit in its group is modified. Therefore, the parity units 
should be distributed across the array rather than all being 
located on a small subset of disks. Otherwise we would 
have the situation where some disks are always busy 
updating the parity units while the others are totally idle. 
Ideally, we want to have the same amount of parity units 
on every disk. This requirement guarantees that the parity 
update workload is uniformly distributed among all disks. 
Additionally, it is required that no two units from the 
same parity group are located on the same disk, so that 
the disk array can always be recovered from one disk 
failure. 
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Fig. 1: An array code with no parity declustering 



Let us consider the following example. Suppose there 
are four disks in the disk array. Each disk is divided into 
several units. They are either data units (D) or parity 
units (P). Each parity group consists of three data units 
and one parity unit (those that have the same index). The 
parity unit is equal to the XOR-sum of the data units in 
the same parity group. The array in Fig. Q] represents 
the basic pattern of the data/parity layout in this disk 
array. This basic pattern is then repeated many times until 
every unit in each disk is covered by some pattern. This 
pattern of data/parity layout is called an array code for 
the disk array. Column i of the array code corresponds 
to Disk i in the disk array that employs the array code. 
A data/parity entry in Column i represents a data/parity 
unit in Disk i. Without loss of generality, we assume 
that the disk array consists of only one copy of the 
data/parity layout from the array code. In other words, 
we assume that the data/parity layout of the disk array 
looks completely the same as the data/parity layout of 
the array code. Then, throughout this work, we often use 
disks and columns, units and entries, interchangeably. 

The array code presented in Fig. Q] can recover one 
missing column. Hence, the disk array that employs this 
array code can tolerate one disk failure. The reconstruc- 
tion process of the lost column (disk) requires access to 
all entries (units) in every surviving column (disk). 

The parity declustering technique for one-failure toler- 
ant array codes based on BIBDs was originally suggested 
by Muntz and Lui 1 17 1 and investigated in details by Hol- 
land and Gibson [13], Ng and Mattson [18], and Reddy 
and Banerjee [20]. Before describing this technique, we 
need the definitions of i-designs and BIBDs. 

Definition II.l. A t-(n, k, A) design, a i-design in short, 
is a pair (X, B) where A" is a set of n points and B is a 
collection of £>subsets of X {blocks) with the property 
that every i-subset of X is contained in exactly A blocks. 
A 2-(n, k, A) design is also called a balanced incomplete 
block design (BIBD). 

Given a 2-(n,k,X) design, we associate disks with 
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Fig. 2: An array code with parity declustering 



points and parity groups with blocks. As an illustra- 
tive example, consider a 2-(5,4, 3) design with X = 
{0, 1,2,3, 4} and B consisting of five blocks: {0, 1, 2, 3}, 
{0,1,2,4}, {0,1,3,4}, {0,2,3,4}, {1,2,3,4}. Each 
block corresponds to one parity group. For instance, the 
block {1,2,3,4} corresponds a parity group with the 
(three) data units being located in Disks 1, 2, 3 and 
the parity unit located in Disk 4. The data layout of the 
array code is presented in Fig. [2] We can also balance 
the number of parity units in every column by rotating 
the array in this figure cyclically five times (see lfl3l ). 

Since every two elements in the set {0,1,2,3,4} 
appears in precisely three different blocks, every two 
disks share three pairs of units, where units in each 
pair belong to the same parity group. Therefore, when 
one disk fails, precisely three units in each surviving 
disk need to be read for the recovery of units on the 
failed disk. Thus, instead of reading 100% units in each 
surviving disk (as for the array code in Fig. [TJ, the 
reconstruction process now reads 75% units in each disk. 
In other words, by adding one more disk to the array, we 
can reduce the percentage of data that needs to be read 
in each surviving disk for recovery. However, we lose 
the MDS property of the code while spreading out the 
workload over more disks. Now it requires 1.25 disks 
worth of parity instead of just one parity disk as in 
the previous example. Therefore, the parity declustering 
technique can be considered as a way to sacrifice the 
efficiency for faster reconstruction time. 

The connection between the reconstruction of one-disk 
failure and a 2-design is elaborated further as follows. 
If a parity group G contains a unit from a disk D then 
D is said to be crossed by G. The reconstruction of one 
unit requires access to all other units in the same parity 
group. Therefore, in order to have uniform workloads 
during the reconstruction for one disk failure, every two 
disks must share the same number of pairs of units that 
are from the same parity groups. In other words, every 
two disks must be simultaneously crossed by the same 
number of parity groups. If disks and parity groups are 



associated to points and blocks, respectively, then the 
aforementioned property of the data layout becomes the 
familiar requirement for a 2-design: every two points 
must be simultaneously contained in the same number 
of blocks. Thus, the parity technique for one-failure 
tolerant array codes can be summarized as follows: 

Algorithm 1 f lfTTl . |fl3l, ESI, l|20l) 

• Input: n is the number of physical disks in the array 
and k is the parity group size. 

• Step 1: Choose a parity group G with k — 1 data 
units and one parity unit. 

. Step 2: Choose a 2-(n,fc,A) design V = (X,B). 

• Step 3: For each block Bi = {&i,o, • ■ • , bi,k-i} G B, 
< i < \B\, create a parity group Gi as follows. 
First, Gi must have the same data-parity pattern as 
G. In other words, Gi has k — 1 data units and one 
parity unit, and the parity unit is equal to the XOR- 
sum of the data units. Second, the k— 1 data units of 
Gi are located on disks with labels bi.o, • • • , &i,fc— 2- 
The parity unit of Gi is located on disk with label 

• Output: The n-disk array with \B\ parity groups 
and their layouts according to Step 3. 

In the next sections, we generalize this procedure to 
construct declustered-parity layouts for array codes that 
tolerate more than one disk failure. 

III. Parity Declustering for Two-Failure 
Tolerant Codes via 3-Designs 

To extend the parity declustering technique for two- 
failure tolerant codes, we use balanced 2-parity groups 
instead of parity groups. 

A. S-Parity Groups 

Definition III.l. A S-parity group is an MDS <5-failure 
tolerant array code. More formally, a S-parity group is 
an m X k array that satisfies the following conditions: 
(CI) it contains (fc — S)m data entries and 8m parity 
entries; 

(C2) entries in at most 5 columns can always be recon- 
structed from the entries in other columns. 
Moreover, if a ^-parity group also satisfies the two other 
conditions 

(C3) for the reconstruction of entries in at most 6 
columns, the number of entries in every other col- 
umn that contribute to the calculation must always 
be the same; 

(C4) the number of parity entries in every column must 
be the same, 

then it is said to be balanced. If a <5-parity group does 
not satisfy either (C3) or (C4) then it is said to be 
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unbalanced. We refer to k as the size of the (5-parity 
group. 

Note that the condition (C3) depends on the par- 
ticular reconstruction algorithm used for the (5-parity 
group. Therefore, a (5-parity group can be balanced 
or unbalanced when different reconstruction algorithms 
are employed. In fact, all MDS two-failure tolerant 
array codes, such as Reed-Solomon (RS) codes fl9l . 
EVENODD 0, RDP J5), B-code |24), P-codes JB], X- 
codes ll25ll . are 2-parity groups. However, they are not 
yet balanced in their original form. The vertical codes 
(B-, P-, X-codes), which contain both data and parity 
units in each column, equipped with their conventional 
reconstruction algorithms for one failure, satisfy (C4) 
but not (C3). The horizontal codes (RS, EVENODD, 
RDP), which contain either data or parity units in each 
column, in their original form satisfy neither (C3) nor 
(C4). The following example shows how to modify the 
existing MDS horizontal codes to obtain balanced 2- 
parity groups. 

Example III.2. We first consider RDP codes. Let p be 
a prime. RDP code for a (p + l)-disk array is defined 
as a (p — 1) x (p + 1) array H) (see Fig 0). Its first 
p—1 columns (disks) store data entries (units) and its last 
two columns (disks) store parity entries (units). The first 
parity column (P-column) stores the row-parity entries; 
each of such entries is equal to the XOR-sum of the data 
entries on the same row. The second parity column (Q- 
column) stores the diagonal-parity entries; each of such 
entries is equal to the XOR-sum of the data and row- 
parity entries along some diagonal of the array. Note that 
one diagonal is not used (called the missing diagonal in 
0). 
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Fig. 3: RDP array with p = 5 (reproduced from l23l ) 



The conventional reconstruction rule for RDP is as 



follows. Suppose one column is lost. If it is a data 
column (£>), then each of its entries can be recovered 
by taking the XOR-sum of the data entries in other 
data columns (D) and the row parity entry on the P- 
column that belong to the same row. In this way, the 
Q-column plays no role in the reconstruction of one lost 
data column. If the P-column or the Q-column is lost, 
then its entries can be reconstructed by recalculating the 
parities according to the encoding rule of RDP. Note 
that the reconstruction of the P-column does not require 
access to the Q-column, and vice versa. Hence, the RDP 
array and its conventional reconstruction rule does not 
qualify as a balanced 2-parity group. However, we can 
transform an RDP array into a balanced 2-parity group 
as follows. Let us first label the data columns by 'D' and 
the parity columns by 'P' and 'Q\ respectively. As an 
example, the RDP array (p — 5) in its simplified form is 
depicted in Fig. |4] 
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Fig. 4: An RDP array with p = 5 (simplified layout) 

We consider all possible ways to arrange the P-column 
and the Q-column among all k columns (k = p + 1). 
There are k(k — 1) such arrangements. If k = 6 then 
there are 30 = 6 x 5 possible such arrangements. For 
each of such arrangements of P- and Q-columns, we 
obtain a new array, Ai, < i < k(k — 1). We juxtapose 
all these arrays vertically to obtain a new array Q, which 
contains k(k — 1) times more rows than the original RDP 
array (see Fig. [5] for the case when k = 6). 
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Fig. 5: A balanced 2-parity group obtained from an RDP 
array (p = 5) 

Our goal now is to show that the array Q constructed 
above, together with RDP's conventional reconstruction 
rule, in general, is a balanced 2-parity group. The array 
Q obviously satisfies (CI), (C2), and (C4). We only need 
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to verify Condition (C3) for Q. To recover two missing 
columns, every other column has to be read in full. 
Hence, the reconstruction workload for two missing col- 
umn is already uniform across the columns. To recover 
one missing column, each of other columns either has 
to be read in full or is not accessed at all. Therefore, it 
suffices to regard each column D, P, or Q as a single 
entry, or more precisely, a column-entry, in Q, and use 
the reconstruction rule for RDP as shown in Fig. [6] 
Those column-entries of Q correspond to column-units 
on physical disks where each column-unit is actually 
a column of data/parity units. We actually examine the 
number of column-entries (instead of entries) on each 
column of Q that must be read for column recovery. 

We refer to the group of rows in Q that contains the 
entries from each array Qi as an extended-row of Q. Then 
Q has k(k — 1) extended-rows. For instance, in Fig. [5] Q 
has 30 extended-rows. 

For two distinct columns i and j of Q, we define the 
following quantities: 

• toq: the number of extended-rows that has a D at 
Column i and has a Q at Column j; 

• rpq\ the number of extended-rows that has a P at 
Column i and has a Q at Column j; 

> rqp: the number of extended-rows that has a Q at 
Column i and has a P at Column j. 
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Fig. 6: Reconstruction rule for an RDP array 

According to the reconstruction rule of RDP arrays 
(Fig. these extended-rows (that define rp>Q, rpQ, and 
tqp as above) are precisely the extended-rows of Q on 
which the recovery of the column-entry in the ith column 
does not require access to the column-entry in the j\h 
column. Therefore, the number of column-entries to be 
read in column j during the reconstruction of column i 
is precisely 

k(k - 1) - r DQ - r PQ - r QP . 

Hence, if rp>Q, rpQ, and tqp are all constants for every 
pair then the reconstruction workload is uniformly 
distributed to all surviving columns. As the extended- 
rows of Q correspond to all possible arrangements of 
P-, Q-, and D-columns, we have 

rDQ =k-2, r PQ = 1, r QP = 1, 

for every pair of columns i and j of Q. Therefore, Q 
satisfies (C3). 



The same modification also turns an EVENODD array 
code or a RS code into a balanced 2-parity group. In 
fact, this method works for every horizontal array code, 
as long as they have separate parity columns (P- and 
Q-columns) and have reconstruction rules that can be 
clearly stated in tables similar to the one in Fig. [6] 

Note that a simple cyclic rotation does not turn a 
horizontal array code into a balanced 2-parity group. 
For instance, consider an array obtained by juxtaposing 
vertically all cyclic rotations of an RDP array with 
p = 5 as in Fig [7] Suppose the first column is lost. For 
reconstruction, according to the rule illustrated in Fig. [6] 
one needs to access five column-entries on the second 
column and only four column-entries on the last column. 
Hence, the reconstruction workload is not distributed 
uniformly among the surviving columns. 
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Fig. 7: Rotated RDP array does not form a balanced 2- 
parity group (p = 5) 



Definition III.3. The balanced 2-parity group obtained 
from an RDP array code as in Example MI. 21 is called an 
(balanced) RDP 2-parity group. An EVENODD 2-parity 
group and an RS 2-parity group are defined in the same 
way. 

Lemma III.4. Suppose Q is a balanced 
RDP/EVENODD/RS 2-parity group of size k. Then 
to reconstruct a missing column of Q, one needs to read 
a portion jrE§ of the total content of each other column. 
In fact, this also holds for every horizontal code that 
has the same reconstruction rule as the RDP code. 

Proof: Appendix lAl ■ 

B. Design of Declustered-Parity Layouts via 3-Designs 

Recall that the size k of a 2-parity group Q is its 
number of columns. Each column of Q corresponds to 
a column-unit in a physical disk, which is a column of 
data/parity units. 
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Fig. 8: Simplified layout of a 2-parity group 



The following algorithm extends Algorithm 1 to con- 
struct declustered-parity layout for two-failure tolerant 
codes. 

Algorithm 2 

• Input: n is the number of physical disks in the array 
and k is the parity group size. 

• Step 1: Choose a balanced 2-parity group Q of size 
k. 

• Step 2: Choose a 3-(n, k, A) design T> = (X, B). 

• Step 3: For each block Bi = {6^0, ■ • ■ , &i,fe-i} S 
B, < i < \B\, create a balanced 2-parity group 
Qi as follows. First, Qi must have the same data- 
parity pattern and the same reconstruction rule as 
Q. Second, the k columns of Q. L are located on disks 
with labels b ifi , . . . , b iik -i- 

• Output: The n-disk array with \B\ parity groups 
and their layouts according to Step 3. 

Note that even though Qi, < i < \B\, all have the 
same data-parity pattern of Q, on the physical disks, they 
store independent sets of data/parity units. The steps in 
Algorithm 2 are illustrated in the following example. 

Example III.5. Suppose Q is a balanced 2-parity group 
of size four. For instance, Q can be obtained from a 
2x4 RDP array (p = 3) using the method described 
in Example MI. 21 Then the simplified layout of Q is as 
follows (Fig. [9). Each column of Q actually corresponds 
to a column of 24 = 2 x (4 x 3) parity/data units on a 
physical disk. 

Q 



Fig. 9: A balanced 2-parity group of size four 

Suppose we have n = 8 physical disks. Consider the 
following 3-(8,4, 1) design V = (X,B) where 

A" = {0,1, 2, 3, 4, 5, 6, 7}, 



and 

B={{0, 1, 2, 3}, {0, 1, 4, 5}, {0, 1, 6, 7}, {0, 2, 4, 6}, 

{0, 2, 5, 7}, {0, 3, 4, 7}, {0, 3, 5, 6}, {4, 5, 6, 7}, 
{2, 3, 6, 7}, {2, 3, 4, 5}, {1,3, 5, 7}, {1,3, 4, 6}, 

{1,2, 5, 6}, {1,2, 4, 7}}. 

The resulting array code C is depicted in Fig. [10] There 
are 14 2-parity groups in C, namely Qi, < i < 14. The 
2-parity group Qi has its columns, labeled by i, spread 
across the disks indexed by elements from the block Bi £ 
B, < i < 14. For example, as B 13 = {1,2,4,7}, the 
columns of £13, labeled by 13, are located on Disk 1, 
Disk 2, Disk 4, and Disk 7. As each Qi is a 24 X 4 array, 
C is actually a 168 x 8 array (168 = 7 x 24). 
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Fig. 10: The resulting array code C 



Theorem III.6. Algorithm 2 produces an array code that 
satisfies the following properties 

(PI) it can tolerate at most two simultaneous disk fail- 
ures; 

(P2) when one or two disks fail, the reconstruction work- 
load is evenly distributed to all surviving disks; 

(P3) every column of C has the same amount of parity 
units. 

Proof: Appendix IE1 ■ 

We now give a high level explanation of how 3-designs 
and balanced 2-parity groups work well to produce 
declustered-parity layouts for two-failure tolerant codes. 

First, let us examine again the application of 2-designs 
to one-failure tolerant codes. When one disk fails, it is 
required that all other disks contribute the same amount 
of data accesses during the reconstruction process. In 
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other words, we are examining pairs of disks (one failed, 
one survived) and want to make sure that all of these 
pairs have the same amount of related data/parity units 
(Fig. [TTli. (Related units are units that belong to the same 
parity group). On the other hand, in a 2-design, a similar- 
looking condition is applied to pairs of points: every pair 
of points must belong to the same number of blocks. That 
is how the connection between one-failure tolerant codes 
and 2-designs could be established. 



Surviving Disk Lost Disk Lost Disk 



Surviving Disk 



Lost Disk 




Number of related units between 
2 disks must be a constant 

Fig. 11: Requirement for any pair of disks 

The problem of designing declustered-parity layouts 
for two-failure tolerant codes also has a similar re- 
quirement. It is required that when one or two disks 
fail, all surviving disks contribute the same amount of 
data accesses during the reconstruction process. Suppose 
two disks fail. We are in fact examining groups of 
three disks (two failed, one survived) and want to make 
sure that all of these groups have the same amount of 
"related" data/parity units (Fig. [12}. (We use a different 
meaning here for "related units". See Appendix [B] for 
more details.) If we consider a 3-design, the key property 
is that every group of three points must be contained 
in the same number of blocks. At first sight, it is not 
clear how to translate this condition on points/blocks 
back to the aforementioned condition on disks/groups. 
However, one can do so with the help from some results 
in Design Theory. More details can be found in the 
proof of Theorem IIII.6I in the Appendix [B] Note also 
that as a 3-design is also a 2-design (see Corollary IB. 31 ). 
uniform workload for reconstruction of one failed disk 
is automatically guaranteed. 

The balance of the 2-parity group used in Algorithm 2 
is another key condition to guarantee the balanced re- 
construction workload. In the following example, it is 
demonstrated that Algorithm 2 applied to an unbalanced 
2-parity group does not produce a code with this prop- 
erty. 

Example III.7. Suppose the 3-design V in Example |III.5l 
and Q, an RDP 2x4 array, are used in Algorithm 2. 
Note that Q is an unbalanced 2-parity group with the 




Number of related units between 
3 disks must be a constant 

Fig. 12: Requirement for any group of three disks 
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Fig. 13: Unbalanced input leads to unbalanced output 



reconstruction rule given in Fig. [6] The layout of the 
resulting code is depicted in Fig. [13] 

Suppose Disk and Disk 1 fail. Let us examine the 
number of column-units on Disk 4 and 6, respectively, 
that need to be accessed for reconstruction of Disk 
and Disk 1. According to the reconstruction rule of each 
group (Fig. [6j, five column-units on Disk 4 must be 
accessed, whereas only one column-unit on Disk 6 must 
be accessed (see Fig. [T4l . Therefore, the workload for 
reconstruction of the first two disks is not uniformly 
distributed to the surviving disks. 

The reason why Algorithm 2 fails to produce a desired 
array code in the above example can be explained 
as follows. Even though the 3-design spreads out the 
columns of the 2-parity groups evenly among the disks, 
the columns within each group do not play the same role 
in the reconstruction of a lost column. More specifically, 
the P-column and the corresponding Q-column do have 
different roles in the reconstruction of a Z?-column. 
Indeed, according to the reconstruction rule for RDP 
arrays stated in Fig. [6j the reconstruction of a D-column 
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Fig. 14: Related column-units on Disks 0, 1, 4, and 6. 
The underlined entries are those which must be accessed 
for reconstruction of Disks and 1. An 'X' in a row 
labeled by Group Qi and in a column labeled by Disk 
j means that Disk j does not contain any column-unit 
from Qi. 



requires the access to the P-column, but not to the Q- 
column. For example, even though both Disk 4 and Disk 
6 contain column-units from Q 3 , the column-unit P 3 on 
Disk 4 must be read, while the column-unit Q 3 on Disk 
6 is not read (see Fig. IT4t . If a balanced 2-parity group 
is used instead, we will not have this problem, as every 
column in a balanced 2-parity group plays the same role 
in the reconstruction of a missing column. 

C. Storage Efficiency and Reconstruction Workload 
Trade- Off 

In this subsection we examine the trade-off (of the 
declustered-parity layout produced by Algorithm 2) be- 
tween storage efficiency and the workload on every disk 
during the reconstruction of disk failures. If an M x n 
array code C contains x parity units and Mn — x data 
units then we say that the number of disks worth of parity 
in C is jj . The ratio n — is called the number of disks 
worth of data of C. In other words, C uses -fj disks to 
store parities and n — jj disks to store data. 

Another attribute of the array code C produced by 
Algorithm 2 that needs to be examined is the number 
of rows M, or depth, of C. The depth of C counts how 
many units are there in each of its columns. An array with 
fewer rows results in a smaller-size table being stored 
in the memory and faster (table) look-up. Furthermore, 
a code with a smaller depth provides a better local 
balance (see Schwabe and Sutherland ||2~T1 ). The depth 
of C depends on n, k, and A, as shown in the following 



theorem. When n and k are fixed, the bigger the index 
A is, the more rows C has. Therefore, 3-designs with 
smaller A are preferred. 

Theorem III.8. The array code C produced by Algo- 
rithm 2 satisfies the following properties: 
(P4) C has 

(k - l)(k - 2) 
rows, where m is the number of rows in the 2- 



(P5) 



parity group Q 
C has 



k — disks worth of data and disks 



worth of parity. 
Moreover, if an RDP/EVENODD/RS 2-parity group is 
used in Algorithm 2 then C also satisfies the following 
properties: 

(P6) To reconstruct one failed disk, a portion -^f °f 
the total content of each surviving disk needs to 
be read; 

(P7) To reconstruct two failed disks, a portion 
(k 2)(2n k l) r ^ xotcH content of each surviv- 

(n— l)(n— 2) J J 

ing disk needs to be read. 
Proof: Appendix ICl ■ 

When k = n, that is, there is no parity declustering 
involved, Theorem 1111.81 states the familiar facts about 
an MDS two-failure tolerant array code: C has n — 2 = 
'■"~ 2 ' > " disks worth of data and 2 = — disks worth of 

n H 

parity; to reconstruct one failed disk, a portion of 
the total content of each surviving disk needs to be read; 
and to reconstruct two failed disks, each surviving disk 



needs to be read in full (1 



_ (n-2)(2n-n-l) 



ii). Note that 



(n-l)(n-2) 

the second property does not hold for most of known 
MDS array codes in their original formulations. In fact, 
it only holds for these codes after some transformation 
is applied (see Example IIII.2I) . 

Example III.9. In this example, we fix the number of 
disks in the array to be n = 20. The parity group size 
k varies from 3 to 20. The availability of a particular 3- 
(n, k, A) design can be found in [7, Part II, Table 4.37]. 
Note that a t-design, t > 3, is also a 3-design. In this 
table we choose A to be the smallest possible. 

The third and fourth columns show the percentage of 
data/parity units that have to be read on each surviving 
disk in order to reconstruct one and two failed disks, 
respectively. The fifth column presents the number of 
parity disks to be used when the corresponding parity 
group size k is used. The figures in the third, fourth, 
and fifth columns only depend on n and k. As expected, 
when k increases, the percentage of units that have to 
be accessed for disk recovery increases, and the number 
of parity disks used decreases. Thus, one has to trade 
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Fig. 15: Different parity group sizes lead to array codes 
with different performances (n = 20) 



the storage efficiency for the reconstruction workload (on 
each disk): increasing storage efficiency, which is good, 
leads to increasing workload during disk recovery, which 
is bad, and vice versa. One extreme is when k = n, 
where there is no parity declustering. The array code 
becomes a normal MDS array code, with two disks worth 
of parities and 100% load on every surviving disk during 
the reconstruction of two failed disks. 

The figures in the last column are the depth of the 
resulting array code divided by to, the number of rows of 
the balanced 2-parity group Q (see Algorithm 2). These 
figures depend on n, k, and A. 

The ingredient balanced 2-parity groups Q of size k 
(3 < k < 20) can be constructed using the method 
presented in Example 1111.21 This method can be applied 
to an RS code of length k for an arbitrary k > 3 
to obtain a (k(k — 1)) x k balanced 2-parity group 
(to = fc(fc-l)). For an EVENODD code 0, this method 
produces a (k(k — l)(k — 3)) x k balanced 2-parity 
group (to = k(k — l)(fe — 3)), for every k = p + 2 
where p is a prime. For an RDP code |9|, this method 
produces a (k(k — l)(fc — 2)) x k balanced 2-parity group 
(to = k(k — l)(k — 2)), for every k = p + 1 where p is 
a prime. 

Remark 111.10. Corbett introduced in his patent (8l a 
method to mix n/2 data disks from one array code with 
n/2 data disks from another code to produce an array 
code that has n data disks. When one or two disks fail, 



the reconstruction workload is distributed evenly to all 
surviving data disks (but not to all data/parity disks). His 
method actually uses the complete 3-(n, n/2, A) design 
(X,B) where all (n/2)-subsets of X are blocks. In fact, 
any self-complementary 3-designs would work well with 
his construction (a design is self-complementary if it 
satisfies that B € B if and only if X \ B G E). The 
Hadamard 3-(n, n/2, n/4 — 1) design is such a design 
(see ifTBI ). Using a Hadamard design results in an array 
code of only m(n — 1) rows, where to is the depth of 
the original array codes. By contrast, the construction in 
[8 1 produces an array code of an extremely large depth 

"»(»%)• 



IV. Parity Declustering for 
■Failure-Tolerant Codes via i-DESiGNS 



(t-1) 

The generalization of Algorithm 2 to Algorithm 3 
below that works for (t — 1) -failure tolerant codes (t > 2) 
is straight-forward. 

Algorithm 3 

• Input: n is the number of physical disks in the array 
and k is the parity group size. 

• Step 1: Choose a balanced (t — 1) -parity group Q 
of size k. 

. Step 2: Choose a t-(n, k, A) design V = {X, B). 

• Step 3: For each block B. L = . . . ,&i,fc} € B, 
< i < \B\, create a balanced (t — 1) -parity group 
Qi as follows. Firstly, Qi must have the same data- 
parity pattern and the same reconstruction rule as Q. 
Secondly, the k columns of Qi are located on disks 
with labels b\, . . . , 

• Output: The n-disk array with \B\ parity groups 
and their layouts according to Step 3. 

Relevant i-designs can be found in J7J Part II, Ta- 
ble 4.37] and in the references therein. The ingredient 
balanced (t — 1) -parity group Q in Algorithm 3 can be 
constructed by applying the method in Example IIII.2I to 
any MDS horizontal array code that tolerates t — 1 disk 
failures. More specifically, suppose that the original array 
code has k — t + 1 data columns (D) and t — 1 parity 
columns, namely i^-columns, { = 1, . . . , t — 1. There 
are (t — l)!( t _i) ways to arrange the parity columns of 
the original array. For each of such arrangements, we 
obtain a new array. By juxtaposing vertically all of these 
(t — l)!( t _i) arrays, we obtain a balanced (t — l)-parity 
group. The proof that the above method works for general 
t is almost the same as for t = 3. For example, for t = 4, 
instead of considering just t\dq, rpQ, and tqp, we now 
need to consider other quantities, such as rp 1 p 2 , rop 1 p 2 , 
or rp 1 p 2 p 3 . They are, in fact, all constants. Therefore, the 
arguments go the same way as in Example IIII. 21 We will 
not provide a detailed proof here. 
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Except from the well-known RS codes, some other 
known MDS horizontal (t — 1) -failure tolerant codes (t > 
3) were studied by Blomer et al. 0, Blaum et al. 0, 
H, Huang and Xu lfl4t 

V. Conclusion 

We propose a way to extend the parity declustering 
technique to multiple-failure tolerant array codes based 
on balanced (t — l)-parity groups and i-designs (t > 2). 
Balanced (t — l)-parity groups can be obtained from any 
known horizontal array codes that tolerate up to t — 1 
disk failures. Besides, i-design is a very well-studied 
combinatorial object in the theory of Combinatorial De- 
signs. Therefore, one of the advantages of our approach 
is that we can exploit the rich literature from both Erasure 
Codes theory and Combinatorial Designs theory. 

The second advantage of the approach based on t- 
designs is its flexibility. By simply using different t- 
designs in the array code construction, one can obtain a 
variety of different trade-offs between storage efficiency 
and the recovery time. Note that V = (X,B) where B 
consists of all fc-subset of A 7 is a i-design (called the 
trivial design) for any 1 < t < k < n. Therefore, for any 
given number of disks n and any given parity group size 
k < n, there always exists a t-(n, k, A) design for some 
A. 

One disadvantage of this approach is that sometimes, 
the smallest i-design still has an unacceptably large index 
A, which leads to an impractically deep array code. 
A natural question to ask is whether the depth of the 
array code, in those cases, can be reduced if we relax 
some requirements on the array code. A similar question, 
which is aimed to one-failure tolerant array codes, has 
already been discussed by Schwabe and Sutherland ll2D . 
Another open question is on the issue of constructing a 
balanced (t — l)-parity group. In this work, we show that 
horizontal array codes can be employed to produce such 
parity groups. However, the question of whether vertical 
array codes can also be useful is still open. 
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Appendix A 
Proof of Lemma ITTOI 

From Example MI. 21 for the recovery of one lost 
column of Q, one needs to read 

k(k - 1) - r DQ - r PQ - r QP 

= k(k - 1) - (k - 2) - 1 - 1 
= k(k - 2) 

column-entries in each of the other columns. As each 
column contains k(k — 1) column-entries, the portion of 
content of each column that has to be accessed is 



k(k - 2) _ k - 2 
k(k - 1) ~ k-Y 



Appendix B 
Proof of Theorem IIII.6I 

A. Known Results from Design Theory 

The following results from Design Theory are useful 
in our discussion. 

Theorem B.l. ([22. Theorem 9.7]) Suppose that (X,B) 
is a t-(ri,k,X) design. Suppose that Y, Z C X, where 
YDZ = 0, \Y\=i, \Z\= j, and i + j < t. Then there 
are exactly 

\(n-i-j\ 

,00 _ A i k-i ) 



A 



blocks in B that contain all the points in Y and none of 
the points in Z. In particular, 



\B\ = A 



(o) _ ^n(n - l)(n - 2) 
k(k~ l)(fc-2) ' 



Corollary B.2. Suppose that (X,B) is a 3-(n,k,\) 
design. Then any point x of X is contained in precisely 



A, 



A(n-l)(n-2) 
(k- 1)0-2) 



blocks. 



Proof: Let t = 3, Y = {x} and Z 
Theorem IB. II 



(1) 



and apply 



Corollary B.3. Suppose that (X,B) is a 3-(n, k, A) 
design. Then any two distinct points x and y in X are 
contained in precisely 

A(n-2) 



A, 



fc-2 



(2) 



blocks. 



Proof: Let t = 3, Y = {x, y} and Z = and apply 
Theorem IB. 11 ■ 

Corollary B.4. Suppose that (X,B) is a 3-(n, k, A) 
design. Suppose that x, y, and z are three distinct points 
in X. Then the number of blocks in B that contain both 
x and y but not z is 



(i) _ A(n - fc) 
Aa ~^2~- 



(3) 



Proof: Let t = 3, Y = {x, y}, Z = {z}, and apply 
Theorem IB. II ■ 



Now we are ready to prove Theorem IIII.6I Let C be 
the array code produced by Algorithm 2. Suppose that in 
Q (and hence in every Qi), to recover one (two) missing 
column, precisely t\ (j-i) entries have to be read from 
every other column. 

B. Proof of C satisfying (P3) 

First note that due to Corollary IB. 21 each column of 
C contains precisely Ai = ~n^frn^E^) column-units. 
Therefore, each column of C contains trie same number 
of units. Also, as each column of Qi (that is, each 
column-unit of C) contains the same number of parity 
units for all < i < \B\, each column of C contains the 
same number of parity units. Thus C satisfies (P3). 

C. Proof of C satisfying (PI) 

According to Definition Mill each 2-parity group can 
recover up to two missing columns. Moreover, according 
to Algorithm 2, no two columns of the same group 
are located (as column-units) in the same column of C. 
Therefore, C can tolerate up to two disk failures. Thus C 
satisfies (PI). 

D. Proof of C satisfying (P2) 

Suppose Disk y of C fails. Let x be an arbitrary 
surviving disk of C. According to Corollary IB. 31 in 
points/blocks language, there are A2 blocks in B that 
contain both points x and y. Translated to disks/groups 
language, there are A2 pairs of column-units (u x ,u y ), 
where u x is in Disk x, u y is in Disk y and u x and u y 
are from the same 2-parity group. For such a pair of 
column-units (u x ,u y ), in order to recover u y , precisely 
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Disk x Disk y 
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Qi 

Fig. 16: One disk fails 

T\ units have to be read from u x . Therefore, A2T1 units 
have to be read from Disk x for the recovery of Disk y. 
This number of units is a constant for every pair of Disk 
X and y. Hence, when one disk fails, the reconstruction 
workload is uniformly distributed to all surviving disks. 

Now suppose that Disk y and Disk z of C fail. Let x 
be an arbitrary surviving disk of C. A column-unit u x in 
Disk x is involved in the reconstruction of the two failed 
disks if and only if one of the following three cases holds. 
• Case 1: There exist column-units u y in Disk y and 
u z in Disk z so that u x , u y , and u z all belong to 
some 2-parity group Qi. In this case, as Qi loses 
two columns, namely u y and u z , T2 units have to be 
read from u x for the recovery of the lost columns. 
According to the definition of a 3-design, there are 
precisely A such triple (u x ,u y ,u z ). 



Disk x Disk y Disk z 




Qi 

Fig. 17: Case 1 



• Case 2: There exists a column-unit u y in Disk y 
such that that u x and u y belong to some 2-parity 
group Qi and moreover, none of the columns of 
Qi are located in Disk z. In this case, as Qi loses 
only one column, namely u y , t\ units have to be 
read from u x for the recovery of this lost column. 
According to Corollary IB .41 there are precisely Aj 
such pairs (u x , u y ). 



Disk x Disk y Disk z 




Qi 

Fig. 18: Case 2 

• Case 3: There exists a column-unit u z in Disk z 
such that that u x and u z belong to some 2-parity 
group Qi and moreover, none of the columns of 
Qi are located in Disk y. In this case, as Qi loses 
only one column, namely u z , t\ units have to be 
read from u x for the recovery of the lost column. 
According to Corollary IB .41 there are precisely a| 
such pairs (u x , u y ). 
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Fig. 19: Case 3 
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Therefore, in summary, when Disk y and Disk z fail, 
the number of units to be read from Disk x for the 
reconstruction is precisely 



At 2 



2Al 1 Vi. 



As this number is a constant for every three distinct disks 
x, y, and z, we conclude that when two disks fail, the 
reconstruction workload is evenly distributed across all 
surviving disks. 

Appendix C 
Proof of Theorem IIII.8I 

Suppose the 2-parity group Q employed in Algorithm 2 
has m rows. Recall that r,, i = 1,2, denotes the number 
of entries to be read from every other column when i 
columns of Q are lost. If Q is an RDP/EVENODD/RS 
2-parity group then t\ and t 2 can be explicitly computed. 
Indeed, according to Lemma Hll.41 we have 

k - 2 

Tl = m A-T (4) 

When two columns of Q are lost, all k — 2 other columns 
have to be read in full for the recovery of the lost 
columns. Therefore 



r 2 = m. 



(5) 



A. Proof of C satisfying (P4) 

According to Corollarv lB.21 each column of C contains 
precisely Ai column-entries. Moreover, each of these 
column-entries consists of m entries. Therefore, each 
column of C consists of 

A(n - l)(n - 2) 



M = mAi 



C. Proof of C satisfying (P6) 

We need to prove that if Q is an RDP/EVENODD/RS 
2-parity group then in order to reconstruct one failed 
disk, a portion of the total content of each surviving 
disk needs to be read. 

Suppose one column of C is lost. According to Ap- 
pendix |B] A2T1 entries must be read from each other 
column for the reconstruction of the missing column. 
Since each column of C consists of M entries, a portion 



A 2 Tl 

M 



Mn-2) k-2 
k-2 " l k-l 
A(n-l)(n-2)_ 



k-2 
n — 1 



(k-l)(fc-2) 

of the total content of each surviving disk must be read. 

D. Proof of C satisfying (P7) 

We need to show that if Q is an RDP/EVENODD/RS 
2-parity group then in order to reconstruct two failed 

disks, a portion ^^i)(n-2^ °^ tne tota ^ content °f 
each surviving disk needs to be read. 

Suppose two columns of C are lost. According to 
Appendix iBl Ar 2 + 2A 2 1 ' ) ti entries must be read from 
each other column for the reconstruction of the two 
missing columns. Thus, a portion 

Ar 2 + 2A 2 1) n 



M 



(6) 



of the total content of each surviving column needs to be 
read for the recovery of two columns of C. Substituting 
©, ©, O, and (0 into ©, the ratio in this equation 
can be simplified to 

(k-2)(2n-k- 1) 
(n - l)(n - 2) 



[k - l)(fc - 2) 



entries. 



B. Proof of C satisfying (P5) 

We need to show that C has ( k ~^> n disks worth of 
data and ^ disks worth of parity. 

There are \B\ 2-parity balanced groups and each 
group consists of 2m parity units (see Definition 1111. 1 b . 
Therefore, the total number of parity units in C is 2m\B\. 
Therefore, C contains 

rtAn(n-l)(n-2) 

2m\B\ _ 2\B\ _ 1 fc(fc-i)(fc-2) _ 2n 

M ~ ~ ~ A(n-l)(n-2) ~ _ "fc" 
(fe-l)(fe-2) 

disks worth of parity. We deduce that C contains 

2n <k - 2)n 

n ~T = ^—- 

disks worth of data. 



